Importing required packages

Importing and understanding the dataset

There are no missing values or duplicate rows in the dataset

ID column- customer ID so looks like this column will not be a significant factor in prediction
Age- minimum age is 23 and average age of a customer is 45
Experience- minimum experience is -3- this looks like an error so this column needs pre-proceesing (replace with absolute value)
Income- minimum income of 8K looks very low- this column needs pre-processing
ZIPCode- this can be used to figure out the location, column looks ok
Family- minimum of 1 could mean the customer is single, column looks ok
CCAvg- minimum of 0 could mean the customer does not have or use their credit card, column looks ok
Education- 1: did not complete a college degree, 2: completed a college degree, 3: completed a professional degree, column looks ok
Mortgage- average mortgage is 56K, column looks ok

Exploratory Data Analysis

Univariate Analysis

Age

Experience

Income

Family

CCAvg

Education

Mortgage

Personal_Loan

Securities_Account

CD_Account

Online

CreditCard

Bivariate Analysis

Multivariate Analysis

Observations/Summary from EDA

Data Description:

  1. There are no missing values in the dataset
  2. There are no duplicate rows in the dataset
  3. There are 5000 rows and 14 columns in the dataset
  4. All columns are of datatypes int64 or float64

Observations from EDA:

  1. Experience: 52 negative values in this column- could be a data entry issue.
  2. Income: distribution is skewed to the right- which means there are more customers with lower annual income and
    a few customers with high annual income. Looks like there are also a few outliers in the Income column.
  3. Family: Customers with family size = 1 (single) are the most in the dataset, followed by family size = 2.
    29% of customers are single, 25% of the customers are a couple
  4. CCAvg: distribution is skewed to the right- which means most customers have a lower monthly credit card
    expense. 241 customers spend more than 6K on average per month. There are some outliers in this column.
  5. Education: A large number of customers are those who did not finish an undergraduate degree ~ 42\%.
    28% of customers have an undergraduate degree and 30% of the customers have a professional degree.
  6. Mortgage: 69% of the customers do not have a mortgage.
  7. Personal_Loan: 90% of the customers do not have a personal loan. Percent of customers who took a personal
    loan is 9.6%.
  8. Securities_Account: 89.5% of the customers do not have a securities account.
  9. CD_Account: 94% of the customers do not have a CD account.
  10. Online: 3000 customers have an online presence and 2000 customers do not.
  11. CreditCard: 1500 customers have a credit card with another bank and 3500 customers do not have a credit card
    with another bank.
  12. Extremely high correlation betweek Age and Experience- will lead to multicollinearity.
  13. Customers who have taken the personal loan have an income of greater than 50K.
  14. No apparent relationship between Age, Experience, Family, CCAvg, Online, CreditCard and personal loan.
  15. Fewer customers have taken the personal loan who have Education level as 1, compared to levels 2 and 3.
  16. Customers who dont have a securities account mostly do not have a personal loan.
  17. Most customers who have a CD account also have a personal loan.
  18. Customers who have Education level 1 have higher income.
  19. CCAvg and Income have fairly high correlation ~0.6.

Objective

To predict whether a liability customer will buy a personal loan or not
Which variables are most significant
Which segment of customers should be targeted more

Outlier Detection

Outliers in Income, CCAvg, Mortgage columns

Data Preprocessing

Experience

Income

Zipcode

Removing 'None' in County

CCAvg

Mortgage

Since many customers do not have a mortgage, it would be best to create a column that says if a customer has a
mortgage or no

Income_Log and CCAvg_Log have a couple outliers- we can drop these values

Feature Engineering/Selection

Personal_Loan: Dependent Variable

ID: Just a unique ID for each customer, will not help with prediction. Can be dropped.
Income: Created a log transformed income column to remove skewness. Can be dropped.
ZipCode: Used this to come up with county information for each customer. Can be dropped.
CCAvg: Created a log transformed CCAvg column to remove skewness. Can be dropped.
Mortgage: Created a column to give a yes or no classification for Mortgage. Can be dropped.
City: too many cities to create dummy variables, will use county instead. Can be dropped.

Split Data

Model Evaluation Criteria

Model can make wrong predictions as:

  1. Predicting a customer will buy a personal loan when they do not
  2. Predicting a customer will not buy a personal loan when they actually do

Which case is more important? In this case, I think predicting a customer will buy a personal loan when they do not is bad.
Because the bank would assume a certain profit level because of the incorrect prediction.
Predicting a customer will not buy when they end up buying might also be an unfavorable scenario for the bank.
Becuase the bank now has a situation they did not account for.

Since both scenarios are bad, I will maximize f1 score as the greater the f1 score, the higher the chances of identifying both classes correctly.

Defining Functions for Error Metrics and Confusion Matrix

Logistics Regression (with sklearn library)

Checking Model Performance on Training Set

Checking Model Performance on Test Set

Observations

The training and testing f1_scores are close (0.72 and 0.73).
f1_score on the train and test sets are comparable.
This shows that the model is showing generalised results.
We have build a logistic regression model which shows good performance on the train and test sets.
However, to identify significant variables we will have to build a logistic regression model using the statsmodels library.

Logistics Regression (with statsmodel library) Model

There are some counties for which Personal_Loan has only one type (0)- this is causing the singular matrix error

Data Processing- County

Logistics Regression (with statsmodel library) Model

Iteration 1- Having all Features

Observations

  1. Columns that are not significant- Mortgage_Bin, Age, Experience, County_Santa Clara_county
  2. Multicollinearity could be affecting the p-values.
  3. Positive values of the coefficient show that that probability of customer taking the loan increases with
    the increase of corresponding attribute value.
  4. Negative values of the coefficient shows that probability of customer being a defaulter decreases with the
    increase of corresponding attribute value.

Multicollinearity

Iteration 2- Dropping Age

Iteration 3- Dropping Income_log

Iteration 4- Dropping Income_log and Age

Iteration 5- Dropping Income_log and Experience

Iteration 6- Dropping Income_log, Age and ZIPCode

Iteration 7- Adding constant

Iteration 8- Adding costant and removing Age

Iteration 9- Adding constant and removing Age and County_Other

Best Model

Going with iteration 8 as the best one as it has the highest f1_score

ROC-AUC on training set

Recall has increased but the other metrics have become worse
Model performance has gone down

Precision Recall curve (to see if there is a better threshold)

Balance of recall and precision at 0.42

Model is performing well

Checking performance on Test set

Model Performance Summary

Decision Tree Model

Importing required packages

Defining fuctions to calculate f1_score and to create the confusion matrix

Model is able to perfectly classify all the data points on the training set.
0 errors on the training set, each sample has been classified correctly.

Not a big disparity between the f1_score on test and training data.

Viualizing the Decision Tree

According to the decision tree model, Income is the most important feature for prediction

Using GridSearch for Hyperparameter tuning of our tree model

The model is giving a more generalized result now

Observations from the tree

Income is still the most important feature followed by Family

Cost Complexity Pruning

alpha = 0.005